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(57) Abstract 



In computerized processing of natural-language medical/clinical data including phrase parsing and regularizing, parameters are referred 
to whose values can be specified by the user. Thus, a computerized system can be provided with versatility, for the processing of the data 
originating in diverse domains, for example. Further to a parser (12) and a regularizer (13), the system includes a preprocessor (1 1), output 
filters (14) and an encoding mechanism (15). 
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Description 

System and Method for 
Medical Language Extraction and Encoding 

5 

Background of the Invention 

This invention relates to natural language processing and, more 
specifically, to computerized processing of natural-language phrases found in 
medical/clinical data. 

10 Clinical information as expressed by health care personnel is typically 

provided in natural language, e.g., in English. But, while phrases in natural language 
are convenient in interpersonal communication, the same typically does not apply to 
computerized applications such as automated quality assurance, clinical decision 
support, patient management, outcome studies, administration, research and literature 

1 5 searching. Even where clinical data is available in electronic or computer-readable 
form, the data may remain inaccessible to computerized systems because of its form 
as narrative text. 

For computerized applications, methods and systems have been 
developed for producing standardized, encoded representations of clinical information 

20 from natural-language sources such as findings from examinations, medical history, 
progress notes, and discharge summaries. Special-purpose techniques have been used 
in different domains, e.g., general and specialized pathology, radiology, and surgery 
discharge reports. 

Of particular further interest is a general approach which is based on 

25 concepts and techniques described in the following papers: 

C. Friedman et al., "A Conceptual Model for Clinical Radiology 
Reports". In: C. Safran, ed., Seventeenth Symposium for Computer Applications in 
Medical Care . New York, McGraw-Hill, March 1994, pp. 829-833; 

C. Friedman et al., "A General Natural-Language Text Processor for 

30 Clinical Radiology", Journal of the American Medical Informatics Association. Vol. 1 
(April 1994), pp. 161-174; 



WO 98/19253 



PCTAJS97/19362 



-2- 

C. Friedman et al., "A Schema for Representing Medical Language 
Applied to Clinical Radiology", Journal of the American Medical Informatics 
Association. Vol. 1 (June 1 994), pp. 233-248; 

C. Friedman et al., "Natural Language Processing in an Operational 
5 Clinical Information System", Natural Language Engineering. Vol. 1 (March 1995), 
pp. 83-106. 

Summary of the Invefltjpn 

10 A preferred method for computerized processing of natural-language 

medical/clinical data includes basic steps here designated as phrase parsing and 
—regularizing and, r optionally, code selection Further includedrpreferablyHs a step of 
pre-processing prior to phrase parsing, and a step of output filtering. Output can be 
generated in the form of a printout, as a monitor display, as a database entry, or via 

15 the "information highway", for example. 

In processing, one or several parameters are referred to. The 
parameters are associated with options. To choose an option, the appropriate value is 
assigned to the parameter. A parameter can have a value by default. Of particular 
importance is the inclusion of a parameter which is associated with the 

20 medical/clinical domain or subfield of the input data. Other parameters may be 
associated with the level of parsing accuracy desired, whether code selection is 
desired, the type of filtering, or the format of the output. 

The method can be expressed in a high-level computer language such 
as Prolog, for example, for execution as a system on a suitable general-purpose 

25 computer. In the following, the method and the system will be referred to by the 
acronym MedLEE, short for Medical Language Extraction and Encoding. 

Brief Description of the Drawing 



30 



Fig. 1 is a diagram of the MedLEE system or "server". 
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Fig. 2 is a diagram of a system or application which has an interface 

for MedLee. 

An Appendix hereto includes a printout of computer source code for a 
portion of MedLEE. 

5 

raited Description of Preferred Embodiments 

A natural-language phrase included in medical/clinical data is 
understood as a delimited string comprising natural-language terms or words. The 
10 string is computer-readable as obtained, e.g., from a pre-existing database, or from 
keyboard input, optical scanning of typed or handwritten text, or processed voice 
input. The delimiter may be a period, a semicolon, an end-of-message signal, a new- 
paragraph signal, or any other suitable symbol recognizable for this purpose. Within 
the phrase, the terms are separated by another delimiter, e.g., a blank or another 
15 suitable symbol. 

As a result of phrase parsing, terms in a natural-language phrase are 
classified, e.g M as referring to a body part, a body location, a clinical condition or a 
degree of certainty of a clinical condition, and the relationships between the terms are 
established and represented in a standard form. For example, in the phrase "moderate 
20 cardiac enlargement", "moderate" is related to "enlargement" and cardiac is also 
related to "enlargement". 

In the interest of versatility and applicability of the system to different 
domains, parsing is domain specific as a function of the value assigned to a parameter 
which the system refers to in parsing. Depending on the value of the domain 
25 parameter, the appropriate rules can be referred to in parsing by the system. 

While parsing may be based primarily on semantics or meaning, use of 
syntactic or grammatical information is not precluded. 

Regularizing involves bringing together terms which may be 
discontiguous in a natural-language phrase but which belong together conceptually. 
30 Regular forms or composites are obtained. Regularizing may involve reference to a 
separate knowledge base. For example, from each of the phrases "heart is enlarged", 



WO 98/19253 



PCT/US97/19362 



-4- 

"enlarged heart", "heart shows enlargement" and "cardiac enlargement", a regularizer 
can generate "enlarged heart". 

In code selection, which is optional, a common, unique vocabulary 
term or code is assigned to each regular term by reference to yet another knowledge 
5 base which may also be chosen domain specific. For example, in the domain of X-ray 
diagnostics, the term "cystic disease" has a different meaning as compared with the 
domain of mammography. 

Fig. 1 shows a preprocessor module 1 1 by which natural-language 
input text is received. The preprocessor uses the lexicon knowledge base 101 and 
10 handles abbreviations, which may be domain dependent. With the domain parameter 
properly set, the preprocessor refers to the proper knowledge base. For example, 
^depending on the domain; the abbreviation "P.E:" can be understood as physical 
examination or as pleural effusion. Also, the preprocessor determines phrase or 
sentence boundaries, and generates a list form for each phrase for further processing 
15 by the parser module 1 2. 

The parser module 12 also uses the lexicon 101, and a grammar 
module 102 to generate intermediate target forms. Thus, in addition to parsing of 
complete phrases, subphrase parsing can be used to advantage where highest accuracy 
is not required. In case a phrase cannot be parsed in its entirety, one or several 
20 attempts can be made to parse a portion of the phrase for obtaining useful information 
in spite of some uncertainty. For example, subphrase parsing can be used in 
surveying discharge summaries. 

With the parsed forms as input, and using mapping information 103, 
the phrase regularizer 13 composes regular terms as described above. 
25 From the regularized phrases, the filter module 14 deletes information 

on the basis of parameter settings. For example, a parameter can be set to call for 
removal of negative findings. 

The encoder module 1 5 uses a table of codes 104 to translate the 
regularized forms into unique concepts which are compatible with a clinical 
30 controlled vocabulary. 
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Fig. 2 shows an interface module 21, and the MedLee system 22 of 
Fig. 1. The interface module 21 may be domain-specific, and it may serve, e.g., to 
separate formatted sections from non-formatted sections in a report. Also, the 
interface 22 may serve to pass chosen parameter values to the MedLEE system 22 and 
5 to pass output from the MedLEE system. For example, such an interface can be 
designed for communication over the World-Wide Web or a local network, for input 

to or output from MedLEE. 

Conveniently, each module is software-implemented and stored in 

random-access memory of a suitable computer, e.g., a work-station computer. The 
10 software can be in the form of executable object code, obtained, e.g., by compiling 

from source code. Source code interpretation is not precluded. Source code can be in 

the form of sequence-controlled instructions as in Fortran, Pascal or "C", for example. 

Alternatively, a rule-based system can be used such a Prolog, where suitable 

sequencing is chosen by the system at run-time. 
1 5 An illustrative portion of the MedLEE system is shown in the 

Appendix in the form of a Prolog source listing with comments. The following is 

further to the comments. 

Process_sents with get_inputsents, process_sects and outputresults 
reads in an input stream, processes sections of the input stream according to parameter 
20 settings, and produces output according to the settings. Among parameters supplied 
to Process_sents are the following: Exam (specifying the domain), Mode (specifying 
the parsing mode), Amount (specifying the type of filtering), Type (specifying the 
output format) and Protocol (html or plain). Process_sents is called by another 
predicate, after user-specified parameters have been processed. 
25 Process_sects with get_section and parse_sentences gets each section 

and generates intermediate output for the sentences in each section. 

Outputresults with removefromtarg, write, writelines, markupsents and 
outputhl? filters output if appropriate, produces output in the appropriate format and 
optionally including formats tags for selected words of the original sentence, and 
30 produces error messages and an end-of-output message. 
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Setargs sets arguments or parameter values based on user input or by 

default. 

Removefromtarg filters formatted output by leaving only positive 
clinical information and by removing negative findings from the formatted output. 
5 Another parameter, pare, removes findings associated with past information from the 
formatted output. Any number of different filters can be included as suitable. 

Writelines produces one line per finding, in list format. 
Writeindentform and writeindentfonn2 produce output in indented form. 

Markupsents envelopes the original sentence with tags so that the 
10 clinical information is highlighted. Different types of information can be highlighted 
in different colors by use of an appropriate browser program such as Netscape, for 

example s — — — — — 

Outputhl7 converts output to appropriate form for database 
(xformtodb) and writes out the form in hl7 in coded format. This process uses 
15 synonym knowledge and an encoding knowledge base. 
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APPENDIX 
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Claims 

1 . A computer method for processing medical/clinical data comprising a 
5 natural-language phrase, 

the method comprising parsing the natural-language phrase and 
regularizing the parsed phrase, 

wherein parsing comprises referring to a domain parameter whose 
value is indicative of a medical/clinical domain from which the data originated. 

10 

2. The method according to claim 1 , further comprising preprocessing the 
data prior to parsing, with preprocessing comprising referring to the domain 
parameter. 

1 5 3. The method according to claim 1 , further comprising encoding at least 

one term of the regularized phrase, with encoding comprising referring to the domain 
parameter. 

4. The method according to claim 1 , further comprising filtering the 
20 regularized phrase. 

5 . The method according to claim 1 , further comprising referring to an 
additional parameter which is indicative of the degree to which subphrase parsing is to 
be carried out 

25 

6. The method according to claim 1 , further comprising referring to an 
additional parameter which is indicative of desired filtering. 



7. The method according to claim 1, further comprising referring to an 
30 additional parameter which is indicative of a desired type of output. 
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8. The method according to claim 1 , further comprising referring to an 
additional parameter which is indicative of a desired output format. 

9. A computer system for processing medical/clinical data comprising a 
5 natural-language phrase, 

the system comprising means for parsing the natural-language phrase 
and means for regularizing the parsed phrase, 

wherein the parsing means comprises means for referring to a domain 
parameter whose value is indicative of a medical/clinical domain from which the data 
10 originated. 

_ - <^7 

1 0. The system according to claim 9, further comprising means for 
preprocessing the data prior to parsing, with the preprocessing means comprising 
means for referring to the domain parameter. 

15 

1 1 . The system according to claim 9, further comprising means for 
encoding at least one term of the regularized phrase, with the encoding means 
comprising means for referring to the domain parameter. 

20 12. The system according to claim 9, further comprising means for 

filtering the regularized phrase. 

13. The system according to claim 9, further comprising means for 
referring to an additional parameter which is indicative of the degree to which 

25 subphrase parsing is to be carried out. 

14. The system according to claim 9, further comprising means for 
referring to an additional parameter which is indicative of desired filtering. 

30 15. The system according to claim 9, further comprising means for 

referring to an additional parameter which is indicative of a desired type of output. 
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1 6. The system according to claim 9, further comprising means for 
referring to an additional parameter which is indicative of a desired output format. 

17. A combination of the system according to claim 9 with an interface 

5 module for enabling the system to receive input from and/or to produce standardized 
output for the World-Wide Web and/or a local network. 

18. The combination according to claim 17, further comprising means for 
viewing the output using a standardized browser. 

10 

19. The combination according to claim 1 8, wherein the browser is a Web- 
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